The following packages are required for analysis and visualization in this page:
# General data wrangling
library(tidyverse)
library(DT)
library(readxl)
# Visualization
library(plotly)
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(gridExtra)
library(ggplotify)
library(knitr)
library(kableExtra)
# Modeling
library(glmnet)
library(caret)
library(ranger)
# Ensemble building
library(caretEnsemble)
This page focuses on developing both individual models and model ensembles to predict annual change in CDR-Sum of Boxes based on principle components (PCs) derived from regional tau-PET uptake per year, age at baseline, and sex. This data was prepared in Data Preparation and is the PCA-transformed dataset.
First, load in the data prepared in the previous data preparation phase:
load("../Prepared_Data.RData")
As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.
# Set seed for consistency in random sampling for 10-fold cross-validation
set.seed(127)
train.index <- sample(nrow(cortex.changes), nrow(cortex.changes)*0.75, replace=F)
# Remove unneccessary identifier info from datasets for modeling
cortex.df <- cortex.changes %>% ungroup() %>% select(-RID, -interval_num)
# Pre-processing will be applied in model training with caret
# Subset training + test data for cortical lobe-averaged tau uptake per year data
cortex.train <- cortex.df[train.index, ]
cortex.test <- cortex.df[-train.index, ]
I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.
The first step is to create a caretList of the four regression models I will use:
glmnet)knn)nnet)ranger)I will use the trainControl function to specify ten-fold cross-validation with parallel processing.
# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)
I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.
# Set seed for consistency
set.seed(127)
# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), decay = seq(0, 1, by=0.2)))
# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)
# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")
# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))
# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ .,
data=cortex.train,
trControl=ensemble.control,
metric="RMSE",
preProcess=c("center", "scale"),
tuneList=list(my.neuralnet, my.knn, my.randomforest, my.elasticnet))))
The final chosen parameters for each model can be viewed:
Elastic net# Print elastic net model summary
ensemble.models$glmnet
# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
mutate(alpha=as.character(alpha)) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=alpha)) +
theme_minimal() +
ggtitle("Elastic Net Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "lambda",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "lambda",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.0 0.00032 1.094799 0.15609471 0.7012543
## 0.0 0.00800 1.094799 0.15609427 0.7012543
## 0.0 0.20000 1.094401 0.13804665 0.6929446
## 0.0 1.00000 1.089264 0.13299917 0.6888115
## 0.1 0.00032 1.097088 0.15666046 0.7034021
## 0.1 0.00800 1.096919 0.15533476 0.7021739
## 0.1 0.20000 1.095433 0.14082072 0.6910526
## 0.1 1.00000 1.082621 0.09731275 0.6910564
## 0.6 0.00032 1.097076 0.15671643 0.7033811
## 0.6 0.00800 1.096898 0.15299436 0.7006980
## 0.6 0.20000 1.086487 0.10555798 0.6936131
## 0.6 1.00000 1.072871 NaN 0.6855987
## 1.0 0.00032 1.097177 0.15666515 0.7033982
## 1.0 0.00800 1.097156 0.14965997 0.6997010
## 1.0 0.20000 1.073568 0.27522992 0.6859727
## 1.0 1.00000 1.072871 NaN 0.6855987
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.6 and lambda = 1.
Right off the bat, I notice that the Rsquared value is NaN for the parameters selected (alpha=0.6 and lambda=1). I suspect this is because all coefficients were shrunk to zero aside from the intercept, and there is nothing upon which to calculate the \(R^2\) value. This will become more clear through visualizing the results of these models and the overarching ensemble.
kNN
# Print kNN model summary
ensemble.models$knn
# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line() +
theme_minimal() +
ggtitle("kNN Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.knn.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "k",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "k",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.107192 0.07302604 0.6501571
## 7 1.103811 0.05925652 0.6537839
## 9 1.092773 0.03955904 0.6489966
## 11 1.080964 0.04281474 0.6400736
## 13 1.063219 0.05810530 0.6249247
## 15 1.054257 0.06719400 0.6222031
## 17 1.063972 0.05253539 0.6255530
## 19 1.067825 0.04350845 0.6238405
## 21 1.066776 0.04900117 0.6226638
## 23 1.073400 0.04351178 0.6239306
## 25 1.070993 0.04368831 0.6199548
## 27 1.071035 0.04792044 0.6177739
## 29 1.069877 0.05075853 0.6172153
## 31 1.069343 0.04840288 0.6143285
## 33 1.077242 0.04448998 0.6180380
## 35 1.076238 0.04480808 0.6152146
## 37 1.078176 0.04614663 0.6150701
## 39 1.083048 0.03764211 0.6205318
## 41 1.081515 0.03624130 0.6188928
## 43 1.079126 0.03179071 0.6168043
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
The optimal value of k based on RMSE is k=15, with an associated RMSE of 1.054257. Of mote, the MAE is actually lowest at k=31, with an MAE of 0.6143285, but the difference is very small and likely inconsequential.
Neural Network
# Print neural network model summary
ensemble.models$nnet
# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
mutate(decay=as.character(decay)) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=decay)) +
theme_minimal() +
ggtitle("Neural Network Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.nnet.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "# Neurons in Hidden Layer",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "# Neurons in Hidden Layer",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0 1.245690 0.15166462 0.7053475
## 1 0.2 1.198075 0.09028417 0.7142961
## 1 0.4 1.169390 0.10492935 0.7057446
## 1 0.6 1.141485 0.12019571 0.7016722
## 1 0.8 1.110522 0.16281964 0.6976585
## 1 1.0 1.107683 0.16027783 0.6970955
## 3 0.0 1.712916 0.13281912 0.9139845
## 3 0.2 1.251255 0.03420884 0.7894735
## 3 0.4 1.198274 0.10061319 0.7599557
## 3 0.6 1.159946 0.06386280 0.7331481
## 3 0.8 1.188469 0.06028438 0.7321538
## 3 1.0 1.133342 0.07567047 0.6947239
## 5 0.0 10.468680 0.10319745 2.7156369
## 5 0.2 1.447784 0.09856635 0.8949424
## 5 0.4 1.365099 0.09001070 0.8732478
## 5 0.6 1.218252 0.06958312 0.7633906
## 5 0.8 1.196107 0.10006494 0.7481264
## 5 1.0 1.182999 0.10469178 0.7279717
## 10 0.0 1.998661 0.07082981 1.2652291
## 10 0.2 1.453192 0.06660054 0.9800577
## 10 0.4 1.310242 0.03110645 0.8858864
## 10 0.6 1.255780 0.09319997 0.8084398
## 10 0.8 1.234862 0.07430756 0.7592734
## 10 1.0 1.187241 0.11712181 0.7243907
## 15 0.0 1.741426 0.05719791 1.2122015
## 15 0.2 1.474091 0.07627155 1.0129055
## 15 0.4 1.371093 0.03945652 0.9030756
## 15 0.6 1.305277 0.08997026 0.8305314
## 15 0.8 1.227082 0.12982494 0.7615322
## 15 1.0 1.171387 0.08693675 0.7287818
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 1.
The optimal neural network includes one neuron in the hidden layer, which receives input from all 33 input nodes (31 ROIs, baseline age, and sex) and outputs onto the final prediction node with the addition of a bias term. Here’s a graphical representation of this caret-trained neural network:
par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)
Age, sex (male), cingulate cortex, parietal cortex, and temporal cortex all exert positive weights on the hidden layer, while frontal, insular, and occipital cortices exert negative weights on the hidden layer. This may suggest that the cingulate, parietal, and temporal cortices show more similar SUVR changes than the frontal, insular, or occipital cortices.
Random Forest
# Print random forest model summary
ensemble.models$ranger
# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=splitrule)) +
theme_minimal() +
ggtitle("Random Forest Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.rf.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "# Predictors in Decision Node",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "# Predictors in Decision Node",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 1.133259 0.08268213 0.7057911
## 2 extratrees 1.079462 0.06854085 0.6725601
## 3 variance 1.171525 0.09425731 0.7236953
## 3 extratrees 1.098156 0.06807368 0.6828059
## 4 variance 1.182269 0.08854006 0.7337961
## 4 extratrees 1.098247 0.07611410 0.6831745
## 5 variance 1.186409 0.08702435 0.7325433
## 5 extratrees 1.106771 0.08655482 0.6916539
## 6 variance 1.203915 0.08312578 0.7374184
## 6 extratrees 1.105800 0.09635742 0.6947309
## 7 variance 1.205372 0.08462786 0.7381463
## 7 extratrees 1.113361 0.08141956 0.6982263
## 8 variance 1.219286 0.08633762 0.7451829
## 8 extratrees 1.112603 0.10405555 0.6954723
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 2, splitrule = extratrees and min.node.size = 5.
The RMSE and MAE are lowest when only two predictors are included at each decision node. The error is consistently higher when the splitrule is by variance, so the splitrule of extratrees is selected. Of note, the minimum node size was held at the default value of five.
These four models can be resampled and aggregated using the resamples function:
# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
##
## Call:
## summary.resamples(object = ensemble.results)
##
## Models: nnet, knn, ranger, glmnet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.4559073 0.5362215 0.7502167 0.6970955 0.7991874 0.9266854 0
## knn 0.3657926 0.4400819 0.6808184 0.6222031 0.7410961 0.9337061 0
## ranger 0.4617867 0.5156474 0.7147103 0.6725601 0.7819681 0.9799226 0
## glmnet 0.4474888 0.5692098 0.7479911 0.6855987 0.7901979 0.9026879 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.5445679 0.7521224 1.173401 1.107683 1.308853 1.829196 0
## knn 0.4652121 0.6544561 1.088401 1.054257 1.401174 1.783437 0
## ranger 0.5581244 0.7150456 1.118856 1.079462 1.353021 1.865138 0
## glmnet 0.5007390 0.7217365 1.101043 1.072871 1.375093 1.818658 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.0011153793 0.008785862 0.17664760 0.16027783 0.27709588 0.3569507 0
## knn 0.0006141381 0.001677300 0.01897170 0.06719400 0.10810592 0.2946562 0
## ranger 0.0014000337 0.014789597 0.05079104 0.06854085 0.08740914 0.2780762 0
## glmnet NA NA NA NaN NA NA 10
Looking at the error statistics across sample iterations, the knn model has the lowest mean RMSE and mean MAE here. By contrast, the neural network has the highest mean RMSE and mean MAE. As noted above, the \(R^2\) values are all NA for the elastic net (glmnet), suggesting no predictors were retained following regularization.
It’s also useful to look at the regression correlation between these component models:
cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
##
## Root-Mean Square Error Correlation
## nnet knn ranger glmnet
## nnet 1.0000000 0.9466783 0.9745663 0.9587457
## knn 0.9466783 1.0000000 0.9873108 0.9923165
## ranger 0.9745663 0.9873108 1.0000000 0.9911293
## glmnet 0.9587457 0.9923165 0.9911293 1.0000000
##
##
## Mean Absolute Error Correlation
## nnet knn ranger glmnet
## nnet 1.0000000 0.9258331 0.9281142 0.9514522
## knn 0.9258331 1.0000000 0.9806123 0.9811033
## ranger 0.9281142 0.9806123 1.0000000 0.9678099
## glmnet 0.9514522 0.9811033 0.9678099 1.0000000
All four models are very strongly correlated (R>0.9) for both RMSE and MAE, suggesting they will suffer from the same predictive pitfalls. This is certainly not ideal, but does not preclude continued analysis. These error metrics can be visualized with confidence intervals using the dotplot function:
# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE"))
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)
These four models show very comparable RMSE confidence intervals. kNN shows a somewhat lower mean absolute error compared to the other three models, but the confidence interval bars overlap considerably.
Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time. Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:
set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)
# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet
## They were weighted:
## 2.9362 0.1623 0.9061 -0.2833 -7.7576
## The resulting RMSE is: 1.105
## The fit for each individual model on the RMSE is:
## method RMSE RMSESD
## nnet 1.107683 0.4105291
## knn 1.054257 0.4469817
## ranger 1.079462 0.4234083
## glmnet 1.072871 0.4254785
After stacking with a generalized linear model combination, kNN still shows the lowest error (RMSE) of the four models. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:
autoplot(stacked.ensemble.glm)
The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. Interestingly, it looks like the elastic net and kNN regression models showed lower RMSE values than the overall ensemble. The middle-left photo shows the relative weights ascribed to each model in the ensemble; glmnet received a large negative weight (likely due to issues with no retained predictor variables), while kNN received the greatest positive weighting.
Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.
set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)
The summary statistics for each model can be displayed:
cat("\nStacked ensemble, generalized linear model:\n")
stacked.ensemble.glm
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
##
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## Generalized Linear Model
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.105009 0.05202703 0.6987817
##
##
##
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## Random Forest
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.176735 0.04518570 0.7354206
## 3 1.186757 0.06273812 0.7392686
## 4 1.186169 0.07393237 0.7356158
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
##
##
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## glmnet
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 220, 222, 221, 221, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0003759611 1.081113 0.05213553 0.6930661
## 0.10 0.0037596112 1.080933 0.05217147 0.6928954
## 0.10 0.0375961122 1.077394 0.05382963 0.6895900
## 0.55 0.0003759611 1.081104 0.05217397 0.6930713
## 0.55 0.0037596112 1.080092 0.05285595 0.6920865
## 0.55 0.0375961122 1.071503 0.05622865 0.6847089
## 1.00 0.0003759611 1.081109 0.05217428 0.6930760
## 1.00 0.0037596112 1.079285 0.05350495 0.6913866
## 1.00 0.0375961122 1.068036 0.05924443 0.6840631
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03759611.
Each of these four models show pretty comparable RMSE and MAE values. The RMSE and MAE are slightly larger for the random forest model than any of the other three models, though this difference may not be significant and may not translate to the out-of-sample data.
These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.
Model predictions using training data:
# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
ensemble.rf.train <- predict(stacked.ensemble.rf)
real.train <- cortex.train$CDRSB
# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train,
ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()
datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))
The glmnet elastic net model predicted the same outcome value (0.3539) for every single test case. Together with the NaN \(R^2\) values, this suggests there were issues with retained predictor variables. This will be further explored in Model Evaluation through variable importance analysis.
Model predictions using test data:
# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- cortex.test$CDRSB
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=cortex.test)
knn.test <- predict.train(ensemble.models$knn, newdata=cortex.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=cortex.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=cortex.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=cortex.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=cortex.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=cortex.test)
# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()
datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))
I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:
# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
ensemble.glm=RMSE(ensemble.glm.train, real.train),
ensemble.rf=RMSE(ensemble.rf.train, real.train),
elastic.net=RMSE(glmnet.train, real.train),
knn=RMSE(knn.train, real.train),
neural.net=RMSE(nnet.train, real.train),
random.forest=RMSE(rf.train, real.train),
Metric="Train_RMSE")
# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
ensemble.glm=RMSE(ensemble.glm.test, real.test),
ensemble.rf=RMSE(ensemble.rf.test, real.test),
elastic.net=RMSE(glmnet.test, real.test),
knn=RMSE(knn.test, real.test),
neural.net=RMSE(nnet.test, real.test),
random.forest=RMSE(rf.test, real.test),
Metric="Test_RMSE")
# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
ensemble.glm=R2(ensemble.glm.train, real.train),
ensemble.rf=R2(ensemble.rf.train, real.train),
elastic.net=R2(glmnet.train, real.train),
knn=R2(knn.train, real.train),
neural.net=R2(nnet.train, real.train),
random.forest=R2(rf.train, real.train),
Metric="Train_R2")
# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
ensemble.glm=R2(ensemble.glm.test, real.test),
ensemble.rf=R2(ensemble.rf.test, real.test),
elastic.net=R2(glmnet.test, real.test),
knn=R2(knn.test, real.test),
neural.net=R2(nnet.test, real.test),
random.forest=R2(rf.test, real.test),
Metric="Test_R2")
# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
mutate(Metric = str_replace(Metric, "_", " ")) %>%
pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
mutate_if(is.numeric, function(x) round(x,4)) %>%
kable(., booktabs=T) %>% kable_styling(full_width=F)
| Model | Train RMSE | Test RMSE | Train R2 | Test R2 |
|---|---|---|---|---|
| ensemble.glmnet | 1.0955 | 0.9042 | 0.1413 | 0.0007 |
| ensemble.glm | 1.1884 | 0.9151 | 0.0290 | 0.0000 |
| ensemble.rf | 1.1162 | 0.9163 | 0.0523 | 0.0038 |
| elastic.net | 1.1418 | 0.8978 | NA | NA |
| knn | 1.0789 | 0.9357 | 0.1413 | 0.0007 |
| neural.net | 1.0769 | 0.8896 | 0.1235 | 0.0424 |
| random.forest | 0.6912 | 0.8785 | 0.8890 | 0.0481 |
The \(R^2\) values are all NA for the elastic net model because all the prediction values are identical. The random forest has the lowest RMSE in the training and test data, and the highest \(R^2\) value as well. However, I notice the random forest \(R^2\) is 0.8890 in the training dataset and 0.0481 in the test dataset, suggesting major overfitting.
I will also use scatterplot visualizations for comparison:
# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
# Reshape to facet on model
pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
geom_point(alpha=0.3) +
facet_grid(.~Model, scales="free") +
ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
theme_minimal() +
theme(legend.position="none") +
ylab("Predicted CDR-SoB Change") +
xlab("Actual CDR-SoB Change") +
theme(plot.title=element_text(hjust=0.5))
# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df %>%
# Reshape to facet on model
pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
geom_point(alpha=0.3) +
facet_grid(.~Model, scales="free") +
ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
theme_minimal() +
theme(legend.position="none") +
ylab("Predicted CDR-SoB Change") +
xlab("Actual CDR-SoB Change") +
theme(plot.title=element_text(hjust=0.5))
# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900)
The random forest-predicted CDR-SoB change values are pretty spot-on in the training dataset, but not in the test dataset. All of the models (barring elastic net) over-estimated many of the test cases for which the CDR-SoB change was actually zero. On the other hand, these models all tend to under-estimate the CDR change for test cases with higher CDR-SoB changes (in the 1.5-4 range).
# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
# Reshape to facet on metric -- i.e. RMSE or R2
pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
separate(Metric, into=c("Data", "Metric"), sep="_")
p.ensemble.r2.rmse <- overall.ensemble.results %>%
mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
mutate(Data=factor(Data, levels=c("Train", "Test")),
Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
geom_point() +
geom_line() +
theme_minimal() +
facet_wrap(Metric~., scales="free", nrow=1) +
theme(strip.text=element_text(size=12, face="bold"),
axis.title=element_blank())
# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>%
layout(yaxis = list(title = "R2",
titlefont = list(size = 12)),
xaxis = list(title = "Data Subset",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "Data Subset",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
These plots highlight the great performance of the random forest model in the training dataset but not in the test dataset. The other models – individual and ensemble – are very close for both \(R^2\) and RMSE between the predicted vs. actual CDR-Sum of Boxes rate of change.